Skip to content

fix: route group coordination requests to a single broker to prevent split brain#125

Merged
novatechflow merged 1 commit intoKafScale:mainfrom
klaudworks:fix/group-lease-routing
Mar 2, 2026
Merged

fix: route group coordination requests to a single broker to prevent split brain#125
novatechflow merged 1 commit intoKafScale:mainfrom
klaudworks:fix/group-lease-routing

Conversation

@klaudworks
Copy link
Collaborator

Summary

Closes #121.

When multiple consumers in the same group connect through different brokers, each broker independently tries to coordinate the group. This causes split brain: both brokers assign partitions, neither knows about the other's assignments, and messages get processed multiple times with no errors or warnings.

This PR fixes that by making brokers acquire an etcd lease before coordinating a group. Only the lease holder can coordinate — all other brokers reject the request with NOT_COORDINATOR, and the proxy retries on the correct broker.

What changed

Broker — Before handling any group operation (join, sync, heartbeat, leave, offset commit/fetch, describe), the broker tries to acquire an etcd lease for that group. If another broker already holds it, the request is rejected with NOT_COORDINATOR. If etcd is unreachable, the request is rejected with REQUEST_TIMED_OUT instead of silently proceeding (which was the old behavior that allowed two brokers to coordinate the same group simultaneously).

Proxy — The proxy watches etcd for group lease ownership and routes group requests to the owning broker. If the broker responds with NOT_COORDINATOR, the proxy invalidates its cache and retries on a different broker. The old detection logic used hard-coded byte offsets that broke on newer protocol versions and produced false positives on normal data. This is replaced with proper response parsing. Multi-group DescribeGroups requests are forwarded once without retry since different groups may live on different brokers — the client handles per-group errors natively.

Lease manager — The partition and group lease managers shared identical logic for session management, acquire/release, and etcd transactions. Both now delegate to a shared generic LeaseManager with thin type-safe wrappers on top.

Preexisting bugs fixed along the way

  • Transient etcd errors during group lease checks were silently ignored, allowing two brokers to coordinate the same group. Now returns REQUEST_TIMED_OUT.
  • Proxy failed to detect NOT_COORDINATOR on newer protocol versions (flexible headers shifted byte offsets). Replaced with proper response parsing.
  • Proxy false-positived on normal response data that happened to contain the bytes for error code 16 (e.g., partition number 16, offset value 16). Structural parsing eliminates this.
  • DescribeGroups with multiple groups caused a retry loop between brokers. Now forwarded once — the client handles per-group errors natively.
  • Proxy retry loop aborted on the first connection failure instead of trying another broker. Connection errors now retry like other transient failures.

…split brain

Closes KafScale#121.

Brokers now acquire an etcd lease before coordinating any group operation.
Only the lease holder can coordinate — others reject with NOT_COORDINATOR
and the proxy retries on the correct broker.

Also fixes: silent error swallowing on transient etcd failures, wrong
byte-offset NOT_COORDINATOR detection on flexible protocol versions,
false-positive byte scanning, DescribeGroups multi-group retry loop,
and connect-failure retry abort in the proxy.
Copy link
Collaborator

@novatechflow novatechflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! Thanks :)

@novatechflow novatechflow merged commit 8f7b489 into KafScale:main Mar 2, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: consumer groups with more than 1 consumer leads to GroupCoordinator split brain

2 participants